1 Introduction

Moving to a new city on a tight budget is challenging. Especially, a metropolis like London has high rents and a competitive market that makes it difficult to find accommodation at a reasonable price with just the attributes that you are looking for. Sharing economy services like Airbnb have faciliated the search for a nice room rented out by private agent. The rooms and apartments available are furnished for the user to settle right in. But how do you know if the price you are paying for your flat is acutally a fair price?

Profit of both hosts and the platform itself have skyrocketed in the past years. A typical UK host earns around 3,000 Pound a year (Cox, 2017a). It is certain that profit comes from the user that is paying both the fee of the platform and the profit margin of the host out of his own pocket. If you are on a tight budget yourself you want to pick a price that is market average with the attributes important to you. This paper aims at creating a model to forecast the price a user pay will pay per night for an Airbnb matching his requirements to faciliate the check whether the price of the apartment is indeed the fair price.

2 Description of the dataset

The dataset for this investigation covers all Airbnb offerings in London as per the 4th and 5th of March 2017. It contains 53.904 objects for 95 different variables. Its source is the website “Inside Airbnb - Adding data to the debate” (Cox, 2017b). This in an independent and non-commercial project aimed to examine the effect of Airbnb activities on urban development.

To allow this investigation to be more focused, on its actual goal of helping students to find the right place for their desired Airbnb, the initial dataset was processed by some selections and filters. For example, only apartments with a private room and at least three valid ratings were included. The resulting dataset has 7.020 objects for 78 variables left and shall be described in the following.

2.1 Price

Table 1: Summary of Price Variable
Min Q1 Median Mean Q3 Max
8 35 45 50.06994 59 590

With price beeing the dependant variable of our investigation, it can be considered as the most important. When looking at the summary statistics for the price one may quickly find that 75 % of all Airbnbs are priced at 59 Pounds per night or less. However, there are some severe outliers that range up to a maximum of 590 Pounds.

This leaves in doubt, whether the price follows a normal distriubtion which would be desirable for a later linear regression. In fact, by from plotting the price as to the left side below no normal distribution can be found. However, the plot to the right hand side shows, when using a logarithmic scale on the price it looks almost normally distributed.

Figure 1: Density of Price and Log10 of Price

Figure 1: Density of Price and Log10 of Price

2.2 Rent

With London beeing one of the most expensive cities to live, rent prices can be considered as the major cost of providing an Airbnb. Therefore, we would like to observe the relationship between rent and AirBnB price. However, the initial dataset holds no information on the regular rent price at the location of an Airbnb. Searching for the big property websites such as Rightmove or Zoopla, we found one website called “Find Properly” (see Lokku Ltd., 2017), which utilizes the data from Zoopla and provide the rent and selling price for each region by 217 zipcode, from BR1 to WD25. Using the zipcode, we were able to map the average weekly rent for 1 bed properties to every Airbnb.

Figure 2: Mapping Rent Prices vs. Airbnb Prices

Figure 2: Mapping Rent Prices vs. Airbnb Prices

Mapping the mean rent and the logarithmic Airbnb prices according to their location some of the expected relationship. Nevertheless, it also becomes clear that there is more to an Airbnb price than just the average rent in the particular neighbourhood.

2.3 Location

Table 1: Summary of Price Variable
P-Value Conf Low Estimate Conf High
cor 4.802688e-260 -0.414006 -0.3944327 -0.3744948

When choosing an Airbnb in London, people may consider its location since location decides the convenience to travel around or live in London. In our model, we use the distance to the touristic city center - Picadilly Circus - as a measurement of the Airbnb location. It was calculated by using the Haversine formula and the geographic coordinates of Picadilly Circus (Longitude: -0.133869, Latitude: 51.510067).

From the boxplot and correlation test above, the relationship between distance to the city center and price is significantly negatively correlated. Statistically speaking, the closer to the city center, the higher the price.

Figure 3: Mapping Rent Prices vs. Airbnb Prices

Figure 3: Mapping Rent Prices vs. Airbnb Prices

2.4 Reviews

Additionally to the written reviews, guests can give their hosts star-ratings on the following parameters (see Airbnb Inc., 2017): Overall experience, accuracy, cleanliness, communication, check in, location and value. Overall experience relates to the general impression of the guest and is only calculated for ads with at least three reviews. Accuracy asks how well the ad represented the real properties of the apartment. Cleanliness accounts for tidyness of the flat. Check in and communication both are service-based: Was communication with the host before and during the stay sufficient and was the check in process smooth or difficult? The location is evaluated based on security, comfort and attractiveness of the neighbourhood. Finally, value is a subjective measure to define whether the guests believe that the apartment is worth the price paid - an interesting measure for our analysis.

While the guest gives his ratings on a one-to-five-star scale, the data set transforms this data to a rating from 1 to 10, for the overall rating from 0 to 100. In the table below, the average of reviews is very high: At either 9 or 10 for the subrating scores and at 92 for the overall score. Reviews start at values 2 or 4 for the subcategories and 20 for the overall rating. This means, that ads with good ratings are overrepresented suggesting ads with bad reviews will be unlikely to be booked and, therefore, removed from the website. As the overall score is individually picked, different subcategories have different effects on the overall rating. Overall score is only moderately correlated to location, communication and cleanliness. Accuracy, check in and value are strongly correlated to the points received in overall rating. Transferring these findings to the analysis implies a higher impact of those variables on the model and shows the necessity to analyse both subcategories and overall rating score as they are given independently. The relation between the different rating scores and price is relatively weak. For none of the categories there is even a weak correlation to price.

Table 1: Summary of Price Variable
Name Minimum Maximum Mean Correlation_Rating Correlation_Price
Accuracy 2 10 9 0.77 0.09
Check In 2 10 10 0.78 0.14
Cleanliness 2 10 9 0.67 0.10
Communication 4 10 10 0.68 0.10
Location 3 10 9 0.54 0.30
Value 2 10 9 0.79 0.04
Overall 20 100 92 1.00 0.13

2.5 Property Characteristics

beds The variable beds describes how many deds are available for guests. With more beds available, the price of an accommodation should be higher when setting other parameters equal, if we think about it intuitively. But how exactly does the number of beds influence the price?

summary(data_short$beds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.155   1.000   8.000
cor.test(data_short$beds, data_short$price)
## 
##  Pearson's product-moment correlation
## 
## data:  data_short$beds and data_short$price
## t = 22.832, df = 7018, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2410411 0.2845945
## sample estimates:
##       cor 
## 0.2629518
ggplot(data_short, aes(x = beds)) + geom_bar() + labs(x = "The Number of Beds", y = "Count") 

ggplot(data_short, aes(x = as.factor(beds), y = price)) + geom_boxplot() + labs(x = "The Number of Beds", y = "Price") 

With most of the accommondations having less than three beds, we can still observe a clear positive relation between the number of beds and price, which matches our expectation. In general, the more beds there are, the higher the price is. Since there are not much data for those with more than three beds, we cannot predict the price of them prcisely.

2.6 Amenities

Figure 3: Mapping Rent Prices vs. Airbnb Prices

Figure 3: Mapping Rent Prices vs. Airbnb Prices

Airbnb includes some general information on the property such as the room type, the number of people that can be accommodated or the number of bathrooms. On top of these characteristics, Airbnb contains information on a wide range of amenities for every flat. These range from the availability of Internet and a TV up to a personal doorman or a pool. In order to analyse these, we introduced dummy variables for 53 different amenities, with 46 resulting in usable data, as well as a variable counting the total number of amenities.

We found 7 amenities which influence the price, including some home essentials such as kitchens, TVs, dryers, and washers, facilities like elevators or whether it’s a family-kid friendly environment as well as whether it provides lock on the bedroom door. The price of the accommodation with TVs, elevators, dryers and washers is higher than those don’t, especially for TV. However, it seems the market doesn’t value those accommodations with family-kid friendly environment and kitchen. Their prices are slightly lower than those without those amenities. Probably, those amenities linked to more work and noisy. Another interesting finding is the room without the lock can may have higher price than others, which may be reasoned that the room with lock may mainly in more unsafe regions.

Table 1: Summary of Price Variable
amenities p_vals x_mean y_mean diff_mean
Washer 3.314698e-02 1.662704 1.651315 0.011389410
TV 4.740326e-63 1.691456 1.622949 0.068506824
Familiy / Kid-Friendly 4.614522e-15 1.681312 1.647186 0.034125869
Dryer 2.721383e-47 1.700635 1.637045 0.063589686
Kitchen 5.209762e-01 1.660163 1.664692 -0.004528326
Elevator in Building 1.299341e-19 1.688023 1.648061 0.039962232
Lock on Bedroom Door 6.579133e-04 1.645645 1.663698 -0.018053830

2.7 Offering Characteristics

In this part, we are going to analyse attributes of the ad. We choose 2 variables that we think may influence price_pp intuitively, which includes whether it is instant bookable and cancellation policy.

To attract more customers, sometimes hosts allow instant book of their properties. In terms of instant book, there are 2 kinds of accommodation. In our dataset, TRUE means guests can book the desired property instantly, while FALSE means they have to discuss their plans with the host and wait for approval before they can book.

## Warning in cor.test.default(as.numeric(data_short$instant_bookable),
## data_short$price_log, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(data_short$instant_bookable) and data_short$price_log
## S = 5.7215e+10, p-value = 0.5196
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## 0.007686642

## 
##  Welch Two Sample t-test
## 
## data:  data_short[data_short[, "instant_bookable"] == "TRUE", "price_log"] and data_short[data_short[, "instant_bookable"] == "FALSE", "price_log"]
## t = 0.1975, df = 3076.5, p-value = 0.8435
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.008609714  0.010538443
## sample estimates:
## mean of x mean of y 
##  1.661290  1.660326

Despite the fact that instant book will make the booking process more convient for both hosts and guests, there are only about 26.5% of hosts adopt it.
From the Pearson’s product-moment correlation test, its relatively low correlation shows there is not very linear relation between instant bookability and price pp, while the t-test for means suggests otherwise. If we look at the boxplot, we can see that in general price_pp is higher if the accommodation cannot be booked instantly. The resut does make sense in practice. Under the circumstances where the rent is relatively high, hosts are more likely not to implement instant-book policy in order to protect their property.

cancellation_policy

In addition to instant book, hosts also have the right to choose their own cancellation policy. Cancellation policy decides whether or not guests can get refund and how they can be refunded. There are several cancellation policies form which hosts can choose, including flexible, moderate, strict and super strict. If flexible, guests may get full refund if the reservation is cancelled within limited period, mostly 24 hours prior to the check in. If moderate, fees are fully refundable but within a longer time period. Under the circumstances of strict policy, only 50% of fees may be refunded until 1 week prior to check in. (reference and citation may be needed here, url: https://www.airbnb.co.uk/home/cancellation_policies#strict)

summary(data_short$cancellation_policy)
##        flexible        moderate          strict super_strict_30 
##            1998            2018            3004               0 
## super_strict_60 
##               0
cor.test(as.numeric(data_short$cancellation_policy), data_short$price)
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(data_short$cancellation_policy) and data_short$price
## t = 6.2503, df = 7018, p-value = 4.337e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05109825 0.09762617
## sample estimates:
##       cor 
## 0.0744027
ggplot(data_short, aes(x = cancellation_policy, y=price)) + geom_boxplot() + labs(x = "Cancellation Policy", y = "Price Per Person") 

Surprisingly, accommondations with strict cancellation policy in general have lower price than others. One possible reason might be that with strict policy, hosts’ profits are more secure so that they

Now we want to analyse if there is any correlation between these 2 variables.

cor.test(as.numeric(data_short$instant_bookable), as.numeric(data_short$cancellation_policy))
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(data_short$instant_bookable) and as.numeric(data_short$cancellation_policy)
## t = -11.253, df = 7018, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1560413 -0.1100834
## sample estimates:
##        cor 
## -0.1331339
ggplot(data = data_short, aes(x = cancellation_policy, fill = instant_bookable)) + geom_bar(position = "dodge") + labs(x = "Cancellation Policy", y = "Count") 

As we can see, a larger propotion of apartments with srict cancellation policy are instant bookable, compared with those with other cancellation policies.

Furthermore, the Cartesian coordinates were calculated, using the instructions of Irawan (2014), to plot the data on maps provided by Lovelace & Cheshire (2014).

Bibliography

Airbnb Inc. (2017) How do star ratings work. [Online]. Available from: https://de.airbnb.com/help/article/1257/how-do-star-ratings-work.

Cox, J. (2017a) Airbnb: Surge in uk hosts over past year boosts local economies. The Independent. [Online] Available from: http://www.independent.co.uk/news/business/news/airbnb-hosts-uk-surge-boost-local-economies-online-holiday-rental-london-southwest-northern-ireland-a7940451.html.

Cox, M. (2017b) Inside airbnb - adding data to the debate. [Online]. Available from: http://data.insideairbnb.com/united-kingdom/england/london/2017-03-04/data/listings.csv.gz.

Irawan, D.E. (2014) How to convert lat-long coordinates to utm. [Online]. Available from: https://rpubs.com/dasaptaerwin/19879.

Lokku Ltd. (2017) London house prices by postcode. [Online]. Available from: https://www.findproperly.co.uk/london/postcode/#.WdvonHeZNn4.

Lovelace, R. & Cheshire, J. (2014) Introduction to visualising spatial data in R. National Centre for Research Methods Working Papers. [Online] 14 (03). Available from: https://github.com/Robinlovelace/Creating-maps-in-R.

Appendix

Column Numbers Name Description
1 price Price per Nighty as offered on Airbnb
2 zip_first First half of the London Zipcode
3 mean_rent Mean Rent for the given Zipcode as per SOURCE
4 distance Distance from Picadilly Circus in km
5 - 6 east & north Geographic Cartesian coordinates required for map plotting
7 - 13 review_scores Average customer reviews from Airbnb
14 number_of_reviews Number of customer reviews
15 property_type
16 room_type
17 accommodates
18 bathrooms
19 bedrooms
20 beds
21 amenities_count
22 - 74 amen Dummy Variables for the various amenities
75 minimum_nights
76 instant_bookable
77 cancellation_policy

Imperial College Business School